Dog Bites in New York Dataset Exercise

Tijana Blagojev - R-Ladies Belgrade

Aim of the Exercise

  • We will get acquainted with how R is functioning

  • We will learn about different types of variables

  • We will just scratch a surface of several R packages like parts of tidyverse (dplyr and ggplot)

  • We will create a dashboard with information contained in dog bites dataset

Exercise

First steps

  • After installing R and R studio you need to set a working directory where all your work will be stored.

  • The best way to do this is to choose File/New Project which will automatically store all your information in same place.

  • As we already opened DogsofNewYork.Rproj file it has already set the working directory for us.

R Interface

Packages and Libraries

When you install R, you have basic functions already available within Base R. You can take a look at Introduction to Base R for additional information.

However, in order to access functions or data written by other people there are numerious R packages available.

An R package is a bundle of functions (code), data, documentation, vignettes (examples).

Important note - R is case-sensitive so make sure to check spelling and capitalization!

Packages and Libraries-Code

To access information in R packages they first need to be installed and then accessed through their libraries. Use the following code to install packages and load libraries.

Simple use of R

Type in your console the following command and press enter.

## [1] 4

You use <- to create objects in R. It is called an assignement operator.

## [1] 15

Dataset

The data set on dog bites is taken from R package nycdogs by Kieran Healy. For our exercise it is adapted only to include year 2017 and several variables. So let us see how the dataset looks like.

Important note: You will rarely come accross the dataset that is already prepared for analysis. Usually, you will spend between 50% - 80% of your time on cleaning and preparing data.

Importing a dataset

First, we will import and inspect a csv file about dog bites in New York City for 2017 with the following code.

Inspecting dataset

There are 3072 rows that we will refer to observations and 6 columns that we will call variables. As you may also see, we have different types of variables such as character, date, double (continuous).

## Observations: 3,072
## Variables: 6
## $ date_of_bite <date> 2017-01-02, 2017-01-02, 2017-01-04, 2017-01-07, 2017-01…
## $ breed        <chr> "Labrador Retriever Crossbreed", "Lhasa Apso", "Pit Bull…
## $ gender       <chr> "Male", "Male", "Unknown", "Unknown", "Male", "Unknown",…
## $ spay_neuter  <chr> "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", …
## $ borough      <chr> "Brooklyn", "Brooklyn", "Brooklyn", "Brooklyn", "Brookly…
## $ zip_code     <dbl> 11231, 11211, 11219, 11216, 11216, 11229, 11216, 11206, …

Variables

Measured have the resulting outcome expressed in numerical terms (Numeric):

  • Integer: Age, number of kittens

  • Double (Continuous): Height, weight

Attribute have their outcomes described in terms of their characteristics or attributes:

  • Character: Black, yellow, white

  • Factor (Ordinal): Cold, mild, warm, hot

Creating R markdown dashboard presentation

In top left corner press a document with the plus sign icon and choose R Markdown. Then open Flex Dashboard template.

Creating R markdown dashboard presentation

Flexdashboard Template

Setting up the Appearance of Flexdashboard

Pipe operator

In tidyverse package there is a so-called “pipe” operator %>%. It passes the result of the left hand-side as the first operator argument of the function on the right handside. It is used to connect multiple operations on data together.

Setup part of the R-markdown-Dashboard Document

In the Setup part code, we will import a dog bites data set and create a subset for number of bites per boroughs that we will use in textual part of our dashboard.

Setup part of the R-markdown-Dashboard Document

Number of Bites per Borough in New York

Now let us take a look at the 5 boroughs with the highest number of bites

## # A tibble: 5 x 3
##   borough           n  perc
##   <chr>         <int> <dbl>
## 1 Queens          817    27
## 2 Brooklyn        690    22
## 3 Manhattan       663    22
## 4 Bronx           506    16
## 5 Staten Island   284     9

Textual part of the dashboard

We will use tick `, followed by r and some function and closed with another tick as a formula that will automatically add information in the text, so if we use a subset for another year it will update the data in the text straight away. To access particular value in a dataset you can use the following code where the first number is the number of row and the second one the number of column.

## # A tibble: 1 x 1
##   borough
##   <chr>  
## 1 Queens

Textual part of the dashboard-Code

Textual part of the dashboard result

Congratulations you just coded and knitted your first dashboard!!!

Creating a Searchable Datatable

First, in a Setup part of our dashboard document we will create a table without last column related to zip codes.

Now we will add a searchable table just below the title Table of Dog Bites in New York in 2017 with the help of DT package.

Dasboard progress

Creating a Bar Chart

First, we will create a subset to see which are the three top breed bitters. We will again put this part of code in the first Setup part of our R dashboard/R markdown file.

Using ggplot and plotly

We will use two packages, one (ggplot) to make a bar graph and another one (plotly) to make the graph’s information pop up when hovering. Ggplot is a package created by Hadley Wickam that is based on a grammar of graphics.

Grammar of Graphics

Enables you to specify building blocks of a plot and to combine them to create graphical display you want.

  • data

  • aesthetic mapping

  • geometric object

  • statistical transformations

  • scales

  • coordinate system

  • position adjustments

  • faceting

Creating bar graph

Instead of Chart B we will write: Three breeds with the highest number of bites in 2017 and use this a code for a bar chart.

Bar Chart

As you can see the order of bars is not quite right.

Changing Character Variable to a Factor

We will add in the Setup part of our file the following code that will transform character variable-breed to a factor. We will specify levels so that we can create a proper order of dog breeds according to the number of bites - first level will be Pit Bull, followed by Unknown and then Shih Tzu.

Note: In order to use a particular column/variable in R, we connect dataset and needed column with the dollar sign.

Dasboard Progress

Final Stage - Braaavoo!!!!

Stacked Bar of Spayed/Neutered Dogs

In this final part, we will create a stacked bar chart which will show how many dogs that bit were spayed/neutered and how many of them were male or female. So we will again in Setup part create a subset grouped by spay/neuter and gender. We will also create another column to use it as a pop-up label.

The datadogsgenderspay subset

spay_neuter gender n Info
No Female 271 <br> Spay/Neuter: No <br> Number of bites: 271 <br> Gender: Female <br>
No Male 682 <br> Spay/Neuter: No <br> Number of bites: 682 <br> Gender: Male <br>
No Unknown 1063 <br> Spay/Neuter: No <br> Number of bites: 1063 <br> Gender: Unknown <br>
Yes Female 290 <br> Spay/Neuter: Yes <br> Number of bites: 290 <br> Gender: Female <br>
Yes Male 755 <br> Spay/Neuter: Yes <br> Number of bites: 755 <br> Gender: Male <br>
Yes Unknown 11 <br> Spay/Neuter: Yes <br> Number of bites: 11 <br> Gender: Unknown <br>

Creating stacked bar graph

Instead of Chart C we will write and center the title: Bites based on dog’s gender and whether they were spayed/neutered {align=center} and use this a code for a stacked bar:

Stacked Bar

Dashboard Completed

Word of Caution in this Tale

  • “We infer that something we see in the data applies beyond the time, place and conditions in which it happened to surface.” Ben Jones Avoiding Data Pitfalls.

  • In order to say that Pit Bulls are really agressive we need to do additional research.

  • Is it relevant to make conclusions with this number of observations? Is the data reliable?

  • That is why experts need to be able to create this type of visualisations. They already have expertise needed to draw valid conclusion and this tool can help them reach wider audience as well as follow and contribute to other people’s work.

Great Work and Thank you!